## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 7.4 0.70 0.00 1.9 0.076
## 2 7.8 0.88 0.00 2.6 0.098
## 3 7.8 0.76 0.04 2.3 0.092
## 4 11.2 0.28 0.56 1.9 0.075
## 5 7.4 0.70 0.00 1.9 0.076
## 6 7.4 0.66 0.00 1.8 0.075
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 11 34 0.9978 3.51 0.56 9.4
## 2 25 67 0.9968 3.20 0.68 9.8
## 3 15 54 0.9970 3.26 0.65 9.8
## 4 17 60 0.9980 3.16 0.58 9.8
## 5 11 34 0.9978 3.51 0.56 9.4
## 6 13 40 0.9978 3.51 0.56 9.4
## quality
## 1 5
## 2 5
## 3 5
## 4 6
## 5 5
## 6 5
This project aims to use exploratory data analysis (EDA) techniques in order to explore relationships in one variable to multiple variables. I have selected red wine dataset for exploring visualizations, distributions, outliers, and anomalies.
The main question is “Which chemical properties influence the quality of red wines. Therefore, my main goal is that I will try to find out which chemical properties influence the quality of red wines and implement EDA tehniques using R programming language.
## 'data.frame': 1599 obs. of 12 variables:
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
##
## 3 4 5 6 7 8
## 10 53 681 638 199 18
## Bad Average Good perfect
## 63 681 638 217
The above result shows that majority of the wines have been rated between 5 and 7.
This graph shows us the the minimum of fixed cidity is 4.60 and the maximun is 15.90. This graph is a right skewed.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
This graph is normally distributed with mean of 0.5, minimum amount of volatile cidity of 0.12, and the maximun is 1.58
This graph shows us that the minimum of free sugar dioxide is 1 and the maximun is 72 and its mean is 15. This graph is a right skewed. The majority of free.sulfur.dioxide takes pace between 1 and 40.
Density is normally distributed. The graph is pill shaped
pH is normally distributed. The graph is pill shaped. The most of the data gather around the mean (0.65)
There are 1599 observations in this dataset with 12 variables. Density and pH are normally distributed. free sugar dioxide is right skewed. Its mean is 15 and the mnimum amount is 1 and the maximun is 72 which demonistrate a long-tail. In addition, fixed and volatile acidity, sulfur dioxides, sulphates, and alcohol also seem to be long-tailed.
Our main focus here is quality of the wine. Apart from that, We will also take a look at how other variables affect quality of the wine. The basic characteristics of a wine are sweetness, acidity, tannin, fruit and alcohol content. While our dataset do not have all features, I will try to look into other features that maight be very important in the process of rating a wine.
As mentioned above, I am looking forward to explore major features such as alcohol content, acidity, pH, and sugars. Then investigate how they affect quality of the wine.
The histogram reveals following observations: - Density and pH are normally distributed. - free sugar dioxide, Fixed and volatile acidity, sulfur dioxides, sulphates, and alcohol are long-tailed.
I created new variable and named it (quality_rank) in order to rank the quality of the wine pased on four different levels which are (bad, average, good, and perfect)
Stacked histograms of variables with respect to quality did not reveal much information except that alcohol strongly affects wine quality. As we can see, correlation scatterplots showed strong positive correlation of alcohol with quality. Also, it showed strong negative correlation between volatile acidity and quality. Therefore, that lead us to a general observation which is good wine containa higher alcohol content, higher citric acidity, and lower volatile acidity. As we can see, there is a strong positive relationship between density and fixed.acidity. in the other hand, density has a negative relationship with fixed acidity
I observed interesting relationships between the other features such as - Fixed acidity vs citric acid is (0.67) - Fixed acidity vs density is (0.67) - Fixed acidity vs pH is (-0.68) - Volatile acidity vs citric acid is (-0.55)
The relationship between dinsity and fixed acidity was the strongest. Citric acid and fixed acidity showed a strong positive correlation of (68%). whereas pH and fixed acidity showed a strong negative correlation of (-68%).
In the above plots, darker points indicates better quality wines. High citric acid and low acetic acid (volatile acidity)seems to be a good combination for the quality wine
After exploring many multivariate plots, i can say that a high quality wine can be made of combinations such as, - High alcohol rate and high sulphate level - High alcohol rate and low volatile acidity
While looking for interesting multivariate plots, I created three plots. In volatile acidity vs Alcohol plot, I added rank_quality as color and as a result, very interesting plot occured. There were clusters in the plot; high quality wines had low volatile acidity and total alcohol values, where mid and low quality wines had higher volatile acidity and total alcohol values. It was a big surprise for me since I did not expect such a plot.
Some wines contain more alcohol percentage by volume than others. However, we should keep loking at acidity. In case of wines with alcohol percentege more than 10% wines placed with low amount of acetic acid in wine.
Quality rank depends on potassium sulphate value and this is unexpected result a little bit . If we look at sulphates vs quality level, we will find higher quality wines contain more sulphates S02 . Therefore, The higher the S02, the higher the quality if the wine which means that there is a strong correlation between sulphates and quality.
From the plot which investigates the relationship between alchol and the quality rank, we can see though the data is lying everywhere, there?s a pattern can be drawn that the more alcohol percentage the more the quality of red wine.
The three final plots demonstrate wines quality from the several characteristics. As a result, the same quality wine level might have different proportions of sulphates, acidity and alcohol.
It was interesting to explore this data set. I found it fascinating to determine what characteristics that make wine taste good since I don’t drink.
During my analysis I faced some difficulties such as: - some R functions. I used manual, blogs, books, etc.
Possible future researches: I am going to continue to explore the dataset and apply different methods of analysis in the future by studying the different types of wine and charachristics make a special taste for any indivdual.